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ABSTRACT 

A systematic approach is presented for 
designing systolic arrays and their equivalent 
configurations for certain general classes of 
recursively formulated algorithms. A new method is 
also introduced to reduce the input bandwidth and 
storage requirements of the systolic arrays through 
the study of dependence among the input data. Many 
well known systolic arrays can be rederived and 
also many new 6y6tolic arrays can be discovered by 
this approach. 


I . INTRODUCTION 


A systolic array i6 a network of processors 
that rhythmically process and pass data among 
themselves. It provides pipelining, parallelism, 
and simple adjacent neighbor cell interconnection 
structure so that it is suitable for VLSI 
implementation. While mo6t of the earlier systolic 
array algorithms were discovered beuristically 
[1-3]. there has been various work on systematic 
approaches to the design of systolic array 
algorithms [4*6]. In this paper, ve shall present 
a systematic approach for designing systolic arrays 
and especially focus on their equivalent 
configurations for certain general classes of 
recursively formulated algorithms. In order to 
reduce the input bandwidth and storage requirements 
of the systolic arrays, the dependence among the 
input data is also investigated in details. It is 
6bovn that many well known systolic arrays can be 
rederived and also many new systolic arrays can be 
discovered by this systematic approach. For 
simplicity of illustration, we mainly consider the 
linear systolic array in this paper. The same idea 
can also be generalized to the two dimensional 
mesh-connected systolic arrays. 

II. IMPLEMENTATION OF RECURSIVELY 
FORMULATED ALGORITHMS 


Consider two simple but important vays of data 
flow pattern in a linear systolic array as shown in 
Figure 1 and 2. In these two figures, ?., Q., and 
b . . are three given input data sequences and^R. is 
to J be the output data sequence, where 0£i<m~l ind 
0<j<n-l. For the systolic array t ghown in Figure 
1. Q. and R. are stored in the j processor, where 
R. will be Updated while P. is moving to the right 
add b. . is moving down. For the systoli^array 
shown "in Figure 2, P. is stored in the i 
processor and R. wilt be updated as it is moving to 
the right with while b. . is moving down. All of 
the data movements are synchronized. The R.'s will 
successively have the required output data dfter m 
steps. For convenience, according to the R.'s 
behavior of these two systolic arrays, tbey^are 
respectively named as R-stay and R-move linear 
systolic arrays. There is great similarity betveen 
these two systolic arrays. It can be shown that a 
large class of interesting problems in the real 


world can be implemented by thebe two types of 
linear systolic arrays. Besides, various different 
but equivalent configurations of linear systolic 
arrays can also be derived from them. 


Procedure 1 : Given any problem which can be 

formulated so that it has P., Q., and b.. as three 
input data sequences and R.'as Che output! data 
sequence, where 0<i<jn-l anC 0<i<n-l. if R. can 
be generated through the following recurrence 
equation 


■ «v v v s / i3 >- 


( 1 ) 


where R. ^ contains some initial value, f is any 
functierLof four variables P., Q., b. ., and R. , 
and R. & is the required ouiput^data^R . , theil this 
problem can be implemented by the R-stap linear 
systolic array of n processors and the R-move 
linear systolic array of m processors. □ 


The complexity and the configuration of the 
systolic array depend on the complexity of the 
function f and the generation procedure of b... 
Some regularity and dependence among b . . •s mlp 
greatly simplify the whole system. J 


III. MAPPING INTO FAN-IN TYPE 
LINEAR SYSTOLIC ARRAY 


Note that for the two linear systolic arrays 
shown in Figure I and 2, -the input bandwidth and 
storage requirements are large in comparison to the 
number of processors in the array, which may be 
either infeasible or inefficient for many 
applications of interests. This is mainly because 
the dependence among the b..'s is not efficiently 
utilized so that each processor needs its own 
external input connection due to the existence of 
all the b^.'s. It is expected that under certain 
circumstances not all of these external input 
connections are required. In this paper, we are 
also very interested in the issue of reducing the 
input bandwidth and storage requirements by showing 
under what conditions these external input 
connections can be removed so that only the very 
first processor is allowed to have such a 
connection, i.e., the input sequences can only be 
fanned in through the systolic array. It i6 shown 
that the existence of certain patterns of 
dependence among the b..'s allows themselves to be 
fanned-in generated by Slightly modifying the 
operations involved in each processor without 
losing the property of adjacent neighbor 
interconnection structure. These conditions are 
shown in the following two procedures. 


Procedure 2 : For the R-stay linear systolic 

array, if b. . can be determined through the 
following dependence equation 


b. . = T(b. , b. . , ; u. ; 
ij i-l. j 1*3*1 i 


V j > * 


(2) 



vbere u, ii » variable vbicb depends only on i, v. 
is • variable vbicb depends only on j, and T is a J 
function of four variables, then b.. can be 
generated by the fan-in scheme systolic array as 
shown in Figure 3 rather than being broadcast as 
ehovn in Figure 1, Also note that b , • as veil as 
v., vbicb depends only on j, can be prlioaded in 
tile j processor, and b^ as veil as u^, vbicb 
depends only on i can be deed as a fanned-in input 
sequence, 0 

Mote that for tbe R-stay linear systolic array 
sbovn in Figure 1, if b.. is tbe current input to 
the j processor, then lL_j • is tbe previous- 
input to tbe j processor b. j is tbe 
previous input to tbe (j-l) S professor. It is 
understandable that in order to avoid the violation 
of tbe adjacent neighbor interconnection structure, 
b.. can only depend on b._j • and b. as veil as 
ttj data that can be preloai^d and data that 
can be fanned in, vbicb is vhat Procedure 2 is 
about. In general, tbe systolic array sbovn in 
Figure 3 has tvo sets of input data. One of them 
consists of three fanned-in data sequences, P., u., 
and b. . , vhich depend only on tbe i index, and 
tbe oifier set consists of three preloaded data 
sequences, Q., v. and b_j . , vbicb depend only on 
tbe j index, ^wbeie u-, v.J-^b. and b_j . are used 
to generate all the each professor, 

four registers are required, namely Q , V , B and 
R, vbere registers Q and V are used^to Store the 
preloaded data Q. anS v. respectively. Initially 
register, B.is loided as J b . and register R is set 
to be R. , both of vhicb 1 6^11 be updated as the 
systolii array start operation. The reason to 
include so many data sequences is to take care of 
tbe general cases. However, it is expected that in 
many applications, not all of these fanned-in and 
preloaded data sequences are required. It is often 
the case that tbe fan-in generation process of b. . 
simply depends on two or three data sequences vbicb 
can either be fanned-in or preloaded. Similarly 
for tbe R-move linear systolic array, very similar 
results can be obtained as follovs. 


Procedure 3 : For the R-move linear systolic 

array, if b.. can be determined through tbe 
following dependence equation 


b. . 

i-J 


T(b i-i.j ; b i.j-l { V V j ) * 


(3) 


vhere u. is a variable which depends only on i, v. 
is a variable which depends only on j, and T is a 2 
function of four variables, then b.. can be 
generated by tbe fan-in scheme sysijlic array as 
shown in Figure 4 rather than being broadcast as 
sbovn in Figure 2. Also note that b. , as veil as 
u., vj^ch depends only on i, can be pfeloaded in 
tie processor, and b_j . as veil as v., vbicb 
depends only on j, can be died as a fannei-in input 
sequence. 0 


Note that for tbe R-move linear systolic array 
shovn t jin Figure 2, if b^ . is tbe current input to 
the i proces|gr, tben^iL is the previous 
input to the i processor 'Jpa b^j . is tbe 
previous input to the (i-l) B proceSior. What 
procedure 3 says simply repeats tbe fact that in 
order to avoid the violation of adjacent neighbor 
interconnection structure, b. . can only depend on 
b. . . and b. . , as veil as 1 Phe data that can be 
prelApded an&’Pbe data that can be fanned in. In 
general, the systolic array sbovn in Figure 3 has 


tvo sets of input data. One of them consists of 
three fanned-in data sequences, Q., v., and b_j ., 
which depend only on tbe j index, J and J the other’set 
consists of three preloaded data sequences, P^» u^, 
and b- , vbicb depend only on tbe i index, vbere 
u.» v*S 6. , and b , . are used to generate all 
tie bi.'ai* For eacb 1 fPocessor, three registers are 
required, namely 0 , B and P, vbere registers P and 
U are used to stoFe the preloaded data P. and u.. 
Initially register B is, loaded as b. , and output 
data R. is set to be R. both of L $bxch will be 
updated as tbe systolii array start operation. 


Tbe previous three procedures provide a rather 
systematic approach to design the systolic array 
architecture for tbe implementation of a given 
problem. At first, by checking tbe existence of 
the recurrence relationship as shown in equation 
(1), ve are able to know if there exist any 
systolic arrays as sbovn in Figure 1 and 2. Next, 
by checking the dependence among the b..'s as sbovn 
in equations (2) and (3), ve are able know tbe 
existence of tbe fan-in type systolic arrays as 
shown in Figure 3 and 4 so that only small input 
bandwidth and storage are required. Tbe key issue 
is in how to search for the recurrence function f 
and tbe dependence function T. It is expected that 
there may exist several different forms of 
functions due to different possible approaches to 
formulate a given problem. Various forms of theBe 
functions simply create many different but 
equivalent configurations of systolic arrays. Also 
note that in tbe previous discussion, P, Q, b, u, 
and v are somewhat treated as single variables, 
however it ia clear that they can be set of 
variables and tbe same results still hold. This 
approach can be applied to design systolic arrays 
for many interesting problems in the real world. 
Various new configurations of systolic arrays can 
be derived. In the next section, ve shall 
illustrate this design approach by considering the 
DFT algorithm. 

IV. SYSTOLIC ARRAY ARCHITECTURE 

FOR DISCRETE FOURIER TRANSFORM 


Given n discrete data a. in tbe time domain, 

♦ X 

vhere 0<iv£n-l, and n discrete frequencies = 

( e i2 *7 n )j £ n tbe frequency domain, vhere 0<j<n-l, 
the discrete Fourier transform (DFT) is to compute 


a . V .®" 1 ♦ a V .®” 2 ♦ 
n-1 j n-2 j 


+ *! w j + V 


Let 


f(P, Q, b; R) = (R x b) * P. 


By induction, it can be shown that by letting 

»j tMJ * * V ♦ Vi-2 «> 

and = * n .j» then y.^® ^ = y., is tbe 

required output. Tbe existence of recurrence 
function f and tbe satisfaction of the recurrence 
relationship guarantee that there exists systolic 
arrays for tbe implementation of discrete Fourier 
transform as shown in Figure 5 and 6. 


It can be seen from Figure 5 and 6 that tbe 
b..'s are not totally independent. Note that P. = 
a l £. „ and b. . = W.. In order to see if b. . can be 
fanned-in generate j, let us examine the da U 
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'dependence among the b..'s. Many different forma 
of dependence function i exist. For example* 


b. . = T(b. . b. . u.j v.) 
ij _ x-1. J* i* 3 


(5) 


where v. = V.. The pair of systolic arrays based 
on equations'* (4) and (S) are shovn in Figure 7 and 
8. The systolic array ahovn in Figure 8 is the 
veil known systolic DFT [2] » whose discovery 
appears to be heuristic rather than in a systematic 
manner ss from our approach. For another example 
of T function, note that 


i.e.. 


b. . * V. = V, 


IJ - b J l 1 1 

>. . = T(b. . . ( b. . . { u. j v.) 

lJ = b x '*~ l 1 j 

i.j-1 i 


( 6 } 


where u^ = Vj and , vhich can be either 

used as fanned-in sequences of the R-gtay linear 
systolic array or preloaded in the i processor of 
the R-move linear systolic array. The pair of 
systolic arrays based on equations (4) and (6) are 
shown in Figure 9 and 10. 


Another interesting issue is thst the type of 
function f used in this example does not belong to 
the class of general matrix vector multiplication. 
This confirm the fact that the class of problems 
covered in the Procedure 1 really contains not only 
the class of general matrix vector multiplication. 
As well known, there are two different ways to 
consider the discrete Fourier transform. One shows 
that the OFT is a special case of the evaluation of 
a polynomial and the other shows that the OFT is a 
special case of general matrix vector 
multiplication. The first way was just considered 
in this example. Let us see what can be obtained 
by following the second way. Let 


f(P, Q. b; R) = R ♦ (P x b). 


By induction, it can be sbovn that by letting 



f and the satisfaction of the recurrence 
relationship guarantee that there exists systolic 
arrays for the implementation of OFT as shovn in 
Figure 11 and 12. 


From Figure 11 and 12 it can also be seen that 
the b..'s are not totally independent. Note that 
P^ - a^ and b.. = W. 1 . Let us examine the data 
dependence amoilg tbt b^.'s. Note that 


b. . = 
ij = 


i.e. . 


V . 1 = w.J 
bt . ,W X . 

i.J-1 1 


= w.J- 

1 


b. . = T(b. 


rhK 

i.j-i i 


b. 


i.j-1* 


v. V 

j-i i 




( 8 ) 


where u^ = and b^ , which can be either 

used as fanned-in sequences of the R-gtay linear 
systolic array or preloaded in the i processor of 
the R-move linear systolic array. The pair of 
systolic arrays based on equations (7) and (8) are 
ahovn in Figure 13 and 14. Also note that 


i.e. 


ij 


b. . 

ij 


V.‘ = V. 
3 

T(b. 


i-1. 


rt. 


b. 

1- 


i.iV 


b. . 


i* 


V 




( 9 ) 


where v . = W. and 


• fchl. 


= W 


-1 


which can be either 


preloadtd in^the j ftocestor of the R-atay linear 
systolic array or used as fanned-in sequences of 
the R-move linear systolic array. The pair of 
systolic arrays based on equations (7) and (9) are 
shown in Figure 15 and 16. 


This DFT example shows that under certain 
circumstances it is possible to formulate a given 
problem in several different ways to implement with 
various different but equivalent configurations of 
systolic arrays. 


V. CONCLUDING REMARKS 


A systematic approach is presented for 
designing systolic arrays and deriving their 
equivalent configurations for certain general 
classes of recursively formulated algorithms. This 
approach can be considered as a two-stage design 
procedure. In the first stage, the existence of 
recursiveness is investigated. If it exists, 
according to the same formulation the input data 
are classified into three parts, two of them, P. 
and Q., depend only on one index, and another one 
of thtm, namely b- . depends on both index i and j, 
so that the systolic arrays shown in Figure 1 and 2 
apply. However, for certain applications, it is 
either infeasible or inefficient to store all of 
the b..'s. In the second stage, the dependence 
among^he b..'s is then investigated to see if it 
can be used l t!o fan-in generate the b^.'s through 
the data sequence that can either be^reloaded or 
fanned in. For a given problem, various 
formulations of the recursive property and the 
dependence among the b..’s are possible, which 
simply lead to many different but equivalent 
configurations of systolic arrays. 

So f8r we mainly deal with the linear systolic 
arrays. However, the same technique can be easily 
generalised to the two dimensional mesh-connected 
systolic arrays, since the mesh-connected systolic 
arrays can be simply treated ss the concatenation 
of many linear systolic arrays. 
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Figure 1: The R-stay linear systolic 
array. 
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Figure 3: The fan-in scheme of R-stay 
linear systolic array. Note that the 
register B in the jt * processor is 
initially loaded with b-i,j. 
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Figure 5: R-stay linear systolic array of 
discrete Fourier transform based on 
equation (4). 


Figure 2 - The R-move linear systolic 
array. 
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Figure 4: The fan-in scheme of R-move 
linear systolic array. Note that the 
register B in the ith processor is 
initially loaded with bi,-i. 
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Figure 6: R-move linear systolic array of 
discrete Fourier transform based on 
equation (4). 



Figure 7: R-stay linear systolic array of 
discrete Fourier transform based on 
equations (4) and (5). 


Figure 8: R-move linear systolic array of 
discrete Fourier transform based on 
equations (4) and (5) 





























Figure 9- R-stay linear systolic array of 
discrete Fourier transform based on 
equations (4) and (6). 
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Figure 11: R-stay linear systolic array 
of discrete Fourier transform based on 
equation (7). 


Figure 10: R-move linear systolic array 
of discrete Fourier transform based on 
equations (4) and (6). Note that register 
Up is preloaded with Wi and register 
B is initially loaded with Wi - *. 



Figure 12 : R-move linear systolic array 
of discrete Fourier transform based on 
equation (7). 



Figure 13: R-3tay linear systolic array Figure 14: R-move linear systolic array 

of discrete Fourier transform based on of discrete Fourier transform based on 

equation (7) and (8). equations (7) and (8). Note that in the 

ith processor, register Up is preloaded 
with Wi and register B is initially 
loaded with Wi - *. 



Figure 15: R-stay linear systolic array 
of discrete Fourier transform based on 
equations (7) and (9). Note that in the 
jth processor, register Vp is preloaded 
with Wj and register B is initially 
loaded with Wj - * . 


Figure 16: R-move linear systolic array 
of discrete Fourier transform based on 
equations (7) and (9). 




























